Purdue University: Vaccinated
VAST 2010 Challenge
Hospitalization Records -  Characterization of Pandemic Spread

Authors and Affiliations:

Abish Malik, Purdue University [Primary Contact], amalik@purdue.edu

Shehzad Afzal, Purdue University [Primary Contact], safzal@purdue.edu

Erin Hodgess, University of Houston – Downtown [Faculty Advisor], HodgessE@uhd.edu

David S. Ebert, Purdue University [Faculty Advisor], ebertd@purdue.edu

Ross Maciejewski, Purdue University [Faculty Advisor], rmacieje@purdue.edu

Tool:

Our work utilized and extended work done by the Purdue University Visual Analytics Center’s work on healthcare analysis.  This system is designed to facilitate the visual analytics process on categorical spatiotemporal data.  As such, we utilize a preprocessing step in which emergency department chief complaints are categorized through the University of Pittsburgh’s CoCo classifier [1].  The tool we developed utilizes linked geographic and temporal views for exploring disease spread.  Underlying the views, we apply analytical algorithms based on typical control charting methods for time-series anomaly detection over the categorical chief complaints.  These analytical algorithms feed back into the visualization as glyphs within the time-series, denoting when anomalous temporal shifts have occurred in the data.  These temporal shifts represent deviations from the expected values and indicate a need to investigate the current status of the data.  System components include the map view for spatial data visualization, line chart graph views for visualizing temporal health signals of categorized chief complaints and death records, a stacked graph view for analyzing record linkages between patient visits and deaths, and a statistical summary window providing details on the illnesses by age, gender and chief complaint.  All views are linked to an interactive time slider for animation, exploration and analysis.

 

[1] Chapman, Wendy W, Dowling, John N and Wagner, Michael M (2005), "Classification of emergency department chief complaints into 7 syndromes: a retrospective analysis of 527,228 patients", Ann Emerg Med, 46, 5: 445--455.  Downloaded from http://prdownloads.sourceforge.net/openrods/CoCo_batch_3.zip?download

 

Video:

 

VACCINATED.avi (File tested under Quicktime)

 

ANSWERS:


MC2.1: Analyze the records you have been given to characterize the spread of the disease.  You should take into consideration symptoms of the disease, mortality rates, temporal patterns of the onset, peak and recovery of the disease.  Health officials hope that whatever tools are developed to analyze this data might be available for the next epidemic outbreak.  They are looking for visualization tools that will save them analysis time so they can react quickly.

Given a set of hospital admittance records, we first categorize the data into syndromes (Botulinic, Constitutional, Gastrointestinal, Hemorrhagic, Neurological, Rash, Respiratory, and Other) using the CoCo classifier [1].  Categorized data was then ingested into our system, and time series plots of the categories were analyzed using an Exponentially Weighted Moving Average (EWMA) control chart with a 99% confidence interval upper bound.  We visualized each syndromic category, by country, as a line graph plot in our system.  EWMA is automatically applied to any temporal plot generated by our system and alerts are visualized as red circles on the time series plots.  We visually explore each of the categories, by country, by hospital admittance records and find that two categories (gastrointestinal – Figure 1 and hemorrhagic) have an outstanding number of temporally contiguous alerts in eight out of the eleven given spatial locations (Aleppo, Colombia, Iran, Lebanon, Nairobi, Saudi Arabia, Venezuela and Yemen). Only Thailand, Turkey and Karachi seem unaffected.


Fig1.jpg

Figure 1: Line graphs of the gastrointestinal + hemorrhagic category plotted as patient counts per day by country.


Exploring the gastrointestinal and hemorrhagic categories, we drill down into the hospital admittance records in order to determine disease symptoms.  Our toolkit automatically bins patient admittance free text fields.  Our system provides a linked summary statistics window showing the top five most numerous complaints per day by selected category.  By interactively scrolling through the alerts, we are able to see that the most common hemorrhagic and gastrointestinal admittance record fields corresponding to alerts are: 'vomiting', 'abdominal pain', 'diarrhea', 'fever' and 'nose bleed'.  Figure 2 shows our summary statistics windows containing admittance record information and other automatically calculated statistical summaries of the data.


Figure 2 

Figure 2: Summary statistic view.


We further investigate the deaths by exploring the overall categorized death trends in our linked stacked graph view, Figure 3.  Deaths, by category, totaled by all selected countries, are visualized on a single plot.  As we scroll through time, a texture overlay is plotted to show links between the current day’s patients and the future times in which any subset of these patients died.  From this, we can determine the approximate amount of time it would take a patient, showing signs of gastrointestinal or hemorrhagic syndromes, to die.  We find that from hospital admittance to reported death, the average time span in which a patient may succumb to an illness is ~8 days.  Here, we also notice that a large number of patients categorized as ‘other’ are also dying.  We go back to the summary statistics window and explore the primary admittance text fields associated with the ‘other’ category and find that patients in this category have symptoms including: 'nausea', 'cough', 'headache vomiting', 'back pain', and 'abd pain'.  This indicates problems in handling misspelled fields during classification, thus, we further refine our disease symptoms to include signals misclassified as ‘other’.

 
Figure 3

Figure 3: A stacked graph view of total deaths from all selected countries.


Next, we analyzed the time between the onset and peak of the disease.  In the line graph visualization, the user interactively measures the time from the first alert to the approximate peak of the outbreak.  The user clicks anywhere within the line graph window, and drags the mouse.  This has an effect of creating a tape measure that provides a distance between two points with respect to the temporal axis.  Use of this tool is shown in Figure 4, and the user approximates that from the onset of the disease to the peak is approximately 10-12 days.  Then, we extend the tape measure over the entire series of alerts and see the approximated mortality and attack rate for each country.  For mortality rate, we sum the number of patients that died presenting symptoms of gastrointestinal and hemorrhagic over this time period and divided by the total number of patients seen with gastrointestinal and hemorrhagic syndromes over this time period.  This number represents an approximate mortality rate as we ignore patients categorized as other that may still be demonstrating signs of infection.  The average mortality rate is ~10%.  The average attack rate is calculated as a function of the number of patients with selected syndromes over the total number of patients seen in that time span.  The average attack rate is ~26%.


Figure4.jpg


Figure 4: Using the ‘tape measure’ tool for calculating attack and mortality rates.


Finally, we analyze the death total by syndrome for each country to further strengthen our hypothesis.  We utilize our line graph view for death rates, with an applied EWMA.  We see a trending Gaussian death rate for the eight previously mentioned countries, and we appear to pick up a similar large death toll in Karachi.  We hypothesize that for the Karachi records, the short time span is unable to adequately model the disease.  This results in a false negative, as the death curve seems to indicate the presence of an outbreak as shown in the alert noted in Figure 5 (the first red dot on the left).  However, Turkey and Thailand show no signs of outbreak.


Figure5.jpg

Figure 5:  A line graph view of deaths per day by country.


To summarize, we find that this particular disease outbreak likely corresponds to the following hospital admittance text fields:
'vomiting', 'abdominal pain', 'diarrhea', 'fever' and 'nose bleed' with other potential indicators being 'back pain'.  Further, from initial detection of the disease, it appears that the outbreak peaks within 10-12 days.  Mortality rates 10% with patients succumbing to the illness over the course of ~8 days and attack rates are ~26%.  Further, it appears that the pandemic cycle lasted approximately 22 days from the first alert.  Only Turkey and Thailand appear unaffected in this data set with Karachi death totals implying the need for further investigation.


MC2.2:  Compare the outbreak across cities.  Factors to consider include timing of outbreaks, numbers of people infected and recovery ability of the individual cities.  Identify any anomalies you found.

In comparing outbreaks across countries and cities, we again looked at alerts generated through the EWMA control charting method.  Previously, we had determined that the syndromic categories of interest were hemorrhagic and gastrointestinal.  To compare outbreaks across cities and countries, we utilize the map view visualization component of our system.  As syndromic time series data is often noisy, we aggregate the data by week and explore the data using a choropleth map view where color maps to the percent of patients with hemorrhagic or gastrointestinal syndromes over the total number of patients seen.  Figure 1 shows a screen shot of our map view visualization component.  As we scroll through time, we can hypothesize that the disease progresses from Nairobi City in Kenya to Aleppo in Syria, Lebanon and Iran.  From there, it seems to progress to other Middle Eastern countries (Saudi Arabia and Yemen) to South America, appearing in Colombia and Venezuela.


Fig1-P2.jpg

Figure 1: Choropleth map view of countries as a percentage of the patients showing gastrointestinal or hemorrhagic illnesses out of the total number of patients aggregated by week.  When only city data is available for a country, we visualize the country color based on the syndromic percentages from a given city.


We then utilize the line graph views of syndromes by country to determine the earliest alerts generated by EWMA to determine the approximate timing of the outbreaks by day. In Nairobi City, an initial alert is generated on 5-7-2009, Aleppo on 5-10-2009, Iran 5-11-2009, Lebanon 5-12-2009, Saudi Arabia 5-11-2009, Yemen 5-10-2009, Colombia 5-11-2009, and Venezuela 5-11-2009.  Next, we can also utilize the line graph views of death by country and we again see the trend of deaths in Karachi matches the trend of deaths found in cities where statistically significant alerts are generated.  From this set of data, we hypothesize that the origin of the outbreak is likely to have been Karachi.  We see the first major increase in deaths in Karachi on 5-6-2009.  Given the 8 day lag between infection and death, we can hypothesize that an outbreak may have begun in Karachi as early as 4-28-2009.


To further analyze the number of people infected and the recovery ability of the city, we began exploring the data as a function of the underlying population.  When analyzing the data in this manner, we began discovering other anomalies.  While the underlying city and country populations are ingested into our system, the following analyses were done by hand simply as a means to quickly summarize the data set with respect to the city/country population (although attack and mortality rates are found using our system).


The city of Karachi (population of 11.6 million), while showing no syndromic alerts, experienced a large population loss during this study.  There were 165,606 total records of deaths indicating that 1.4% of the city population died in a 14 day period. Based on the  sudden rise in deaths, it is likely that this city is having difficulty coping with the pandemic and stopped reporting data.


Nairobi City has a population of about 3.5 million people.  From the death records, we see that 43,719 people died during the testing phase, which is roughly 1.25% of the city.  The infection rate was 31%.  Thus Nairobi as one of the initial points of the pandemic and is heavily impacted.


The population of Aleppo is approximately 1.6 million people and we received approximately 1 million hospital records over a 32 day period, indicating ~63% of the population of Aleppo recorded a hospital visit during this time period.  Further, when analyzing the number of deaths over this time period, we find that 4.9% of the population died during that time frame.   As in Karachi, the  sudden rise in deaths likely crippled city infrastructure thus stopping the reporting data.
 

Most of the remaining countries displayed a hospital admittance rate of around 1.0%.  However, Saudi Arabia had a rate of 3.7%, while Lebanon's rate was 10.5%.  The infection rate from gastrointestinal and hemorrhagic was 22% for Saudi Arabia and 24% for Lebanon.


Yemen had some interesting findings in that there were alerts for all but one of the illness categories; that is, respiratory.  The infection rate was 26%, which is right at the mean.  There are some duplicates in the chief complaints across the categories, which may account for the widespread symptoms.